Exploring meta-data of human vaginal microbiome

Group 6

Alberte Englund
Mathilde Due
Line Winther Gormsen
Sigrid Frandsen
Kristine Johansen

STUDY DESCRIPTION

Meta-data from MGnify’s vaginal microbiome genome catalogue 1

  • Uncover patterns in genome quality, taxonomic composition, and ecological characteristics.

  • Identify potential patterns for diagnosis of endometriosis via associated pathogens of the vaginal microbiota based on genus:

    • Anaerococcus, Ureaplasma, Gardnerella, Veillonella, Corynebacterium, Peptoniphilus, Candida, Alloscardovia 2

DATA CLEANING AND WRANGLING

Untidy → tidy data

  1. Split the data in the “lineage” variable into multiple variables of seven taxonomic ranks.
  2. Extract and mutate prefixes and GTDB suffixes (e.g. “_A”) to streamline taxonomies
  3. Each variable occupies a column, and each observation occupies a row.
  4. Mutate all “not provided” and empty strings to NA.
  5. Remove columns that will not be used in our analysis.
  6. Quality of the MAGs - completeness, contamination, and genome quality (high, medium, low) - joined to the left side of tidy data.
  7. Add column to MAGs dataset that flags each endometriosis-associated genus (True/False)
print(readr::read_tsv(here("data/_raw/genomes-all_metadata.tsv")))
# A tibble: 618 × 20
   Genome        Genome_type  Length N_contigs    N50 GC_content Completeness
   <chr>         <chr>         <dbl>     <dbl>  <dbl>      <dbl>        <dbl>
 1 MGYG000303700 MAG          678213         2 466332       47.8         63.7
 2 MGYG000303701 MAG         1500176        18 112881       42.4         87.8
 3 MGYG000303702 MAG         1210062        44  48790       26.4         94.8
 4 MGYG000303703 MAG         1706016        27  89653       44.6         93.7
 5 MGYG000303704 MAG          703182         7 111709       47.8         63.7
 6 MGYG000303705 MAG         2542045       112  34925       48           97.9
 7 MGYG000303706 MAG         1449687       185  10153       34.8         85.2
 8 MGYG000303707 MAG         1874692        90  28768       37.1         99.0
 9 MGYG000303708 MAG         1480380        12 169949       42.2         87.6
10 MGYG000303709 MAG          694644        57  15063       47.9         62.0
# ℹ 608 more rows
# ℹ 13 more variables: Contamination <dbl>, rRNA_5S <dbl>, rRNA_16S <dbl>,
#   rRNA_23S <dbl>, tRNAs <dbl>, Genome_accession <chr>, Species_rep <chr>,
#   Lineage <chr>, Sample_accession <chr>, Study_accession <chr>,
#   Country <chr>, Continent <chr>, FTP_download <chr>
untidy_data <- readr::read_tsv(
  here::here("data/_raw/genomes-all_metadata.tsv"))
  print(
    untidy_data  |>
    dplyr::select(Lineage))
# A tibble: 618 × 1
   Lineage                                                                      
   <chr>                                                                        
 1 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
 2 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
 3 d__Bacteria;p__Bacillota;c__Bacilli;o__Staphylococcales;f__Gemellaceae;g__Ge…
 4 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
 5 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
 6 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidacea…
 7 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Tissierellales;f__Peptoniphilace…
 8 d__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g…
 9 d__Bacteria;p__Bacillota_A;c__Clostridia;o__Saccharofermentanales;f__Fastidi…
10 d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__Saccharimonadales;f__Na…
# ℹ 608 more rows
print(readr::read_tsv(here("data/02_dat_clean.tsv")))
# A tibble: 618 × 21
   Genome        Genome_type  Length N_contigs    N50 GC_content Completeness
   <chr>         <chr>         <dbl>     <dbl>  <dbl>      <dbl>        <dbl>
 1 MGYG000303700 MAG          678213         2 466332       47.8         63.7
 2 MGYG000303701 MAG         1500176        18 112881       42.4         87.8
 3 MGYG000303702 MAG         1210062        44  48790       26.4         94.8
 4 MGYG000303703 MAG         1706016        27  89653       44.6         93.7
 5 MGYG000303704 MAG          703182         7 111709       47.8         63.7
 6 MGYG000303705 MAG         2542045       112  34925       48           97.9
 7 MGYG000303706 MAG         1449687       185  10153       34.8         85.2
 8 MGYG000303707 MAG         1874692        90  28768       37.1         99.0
 9 MGYG000303708 MAG         1480380        12 169949       42.2         87.6
10 MGYG000303709 MAG          694644        57  15063       47.9         62.0
# ℹ 608 more rows
# ℹ 14 more variables: Contamination <dbl>, rRNA_5S <dbl>, rRNA_16S <dbl>,
#   rRNA_23S <dbl>, tRNAs <dbl>, Country <chr>, Continent <chr>, Domain <chr>,
#   Phylum <chr>, Class <chr>, Order <chr>, Family <chr>, Genus <chr>,
#   Species <chr>
tidy_data <- readr::read_tsv(
  here::here("data/02_dat_clean.tsv"))
  print(
    tidy_data  |>
    dplyr::select(Domain, Phylum, Class, Order, Family, Genus, Species))
# A tibble: 618 × 7
   Domain   Phylum          Class           Order           Family Genus Species
   <chr>    <chr>           <chr>           <chr>           <chr>  <chr> <chr>  
 1 Bacteria Patescibacteria Saccharimonadia Saccharimonada… Nanop… Nano… Nanope…
 2 Bacteria Bacillota       Clostridia      Saccharofermen… Fasti… KA00… KA0027…
 3 Bacteria Bacillota       Bacilli         Staphylococcal… Gemel… Geme… Gemell…
 4 Bacteria Bacillota       Clostridia      Saccharofermen… Fasti… Mage… Mageei…
 5 Bacteria Patescibacteria Saccharimonadia Saccharimonada… Nanop… Nano… Nanope…
 6 Bacteria Bacteroidota    Bacteroidia     Bacteroidales   Bacte… Prev… <NA>   
 7 Bacteria Bacillota       Clostridia      Tissierellales  Pepto… Pept… Pepton…
 8 Bacteria Bacillota       Bacilli         Lactobacillales Lacto… Lact… Lactob…
 9 Bacteria Bacillota       Clostridia      Saccharofermen… Fasti… KA00… KA0027…
10 Bacteria Patescibacteria Saccharimonadia Saccharimonada… Nanop… Nano… Nanope…
# ℹ 608 more rows
options(width = 200)
Aug_data <- readr::read_tsv(
  here::here("data/03_dat_aug.tsv"))
  print(
    Aug_data  |>
    dplyr::select(Completeness_quality, Contamination_quality, Overall_quality, endometriosis_associated),
    width = Inf)
# A tibble: 618 × 4
   Completeness_quality Contamination_quality Overall_quality endometriosis_associated
   <chr>                <chr>                 <chr>           <chr>                   
 1 Medium               High                  Medium          No                      
 2 Medium               High                  Medium          No                      
 3 High                 High                  High            No                      
 4 High                 High                  High            No                      
 5 Medium               High                  Medium          No                      
 6 High                 High                  High            No                      
 7 Medium               High                  Medium          Yes                     
 8 High                 High                  High            No                      
 9 Medium               High                  Medium          No                      
10 Medium               High                  Medium          No                      
# ℹ 608 more rows

PROJECT ORGANISATION BY FLOWCHART


DATA DESCRIPTION

  • 618 vaginal metagenome-assembled genomes (MAGs)
  • 25 variables covering taxonomy, assembly quality, and geography
  • High completeness and low contamination for most genomes

Most MAGs belong to only a few dominant phyla.
This indicates strong taxonomic skew in the dataset.


Genome lengths fall within the expected biological range
for vaginal bacterial taxa (typically 1.5–3 Mb).


The majority of the data this study are from USA and China. This creates an uneven geographic distribution, which limits how confidently we can compare countries.

ANALYSIS 1 - Phylogenetic tree

  • Constructed a phylogenetic tree from the taxonomic ranks (Domain → Species) in the augmented dataset.
  • NA values in taxonomy were removed before tree construction.
  • The tree is colored by phylum to show taxonomic clustering.

ANALYSIS 2 - Data quality

Investigating the quality of samples.

  • Scatterplot of completeness (%) vs. contamination (%) for all MAGs
  • The dashed lines mark the high-quality genomes (above 90% completeness and below 5% contamination)
  • Points are colored by phylum
  • Most MAGs cluster in the high-quality area indicating good assemble quality of the MAGs.

ANALYSIS 3 - Associated and non-associated-endometriosis MAGs

  • Compare endometriosis-associated vs non-associated MAGs
  • Assess phylum distribution and GC content
  • Investigate if associated MAGs form a distinct genomic group
  • Associated MAGs appear in only a few phyla

  • Most phyla contain no associated MAGs → Association is genus-specific, not phylum-wide


  • GC% distributions overlap almost fully

  • No clear GC signature of association → No genomic differentiation

ANALYSIS 4 - Taxa Distribution between countries

  • Investigating the distribution of taxa in Countries.
  • Filtered for NA in Countries.
  • Counted taxa instances for each country.
  • Converted to wide format.
  • Big difference in sample size –> normalized data.

Proportion of taxon per country.

Variation between countries. I.e. Order Bacteroidales.

Not tested for significance.

Phylum level.

Only two principal components = 100%.

PC1, Fusobacteria and Bacteroidota (order = Bacteroidales).

Correlation with heatmap.

Test to see if there is a significant difference in the proportion of endometriosis-associated genomes between countries.

Fisher Exact Test Results
country1 country2 p_value odds_ratio CI_low CI_high
USA China 0.0205 2.670 1.113 7.779
USA Denmark 0.1554 4.648 0.738 193.498
Denmark China 1.0000 0.576 0.012 5.077
  • Significant difference between USA and China.

DISCUSSION, CONCLUSION & FUTURE PERSPECTIVES

  • The dataset contains high-quality MAGs with high completeness and low contamination

  • Taxonomy is dominated by a few phyla, with no distinct genomic patterns in endometriosis-associated bacterial groups

  • Sparse and uneven geographic metadata which limits country-based analyses.

  • Improve clinical + geographic metadata

  • More focus on theoretically correct analysis (compositional data analysis)